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University of Evry 

Complex systems in nature and in society are often represented 
as networks, describing the rich set of interactions between objects 
of interest. Many deterministic and probabilistic clustering methods 
have been developed to analyze such structures. Given a network, al- 
most all of them partition the vertices into disjoint clusters, accord- 
ing to their connection profile. However, recent studies have shown 
that these techniques were too restrictive and that most of the ex- 
isting networks contained overlapping clusters. To tackle this issue, 
we present in this paper the Overlapping Stochastic Block Model. 
Our approach allows the vertices to belong to multiple clusters, and, 
to some extent, generalizes the well-known Stochastic Block Model 
[Nowicki and Snijders (2001)]. We show that the model is generically 
identifiable within classes of equivalence and we propose an approxi- 
mate inference procedure, based on global and local variational tech- 
niques. Using toy data sets as well as the French Political Blogosphere 
network and the transcriptional network of Saccharomyces cerevisiae, 
we compare our work with other approaches. 

1. Introduction. Networks have been extensively studied ever since the 
work of Moreno (1934). They are used in many scientific fields to represent 
the interactions between objects of interest. For instance, in Biology, reg- 
ulatory networks can describe the regulation of genes with transcriptional 
factors [Milo et al. (2002)], while metabolic networks focus on representing 
pathways of biochemical reactions [Lacroix, Fernandes and Sagot (2006)]. In 
the social sciences, networks are commonly used to represent relational ties 
between actors [Snijders and Nowicki (1997); Nowicki and Snijders (2001)]. 

In this context, many deterministic and probabilistic clustering methods 
have been used to acquire knowledge from the network topology. As shown in 
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Newman and Leicht (2007), most of these techniques seek specific structures 
in networks. Thus, some models look for community structure where vertices 
are partitioned into classes such that vertices of a class are mostly connected 
to vertices of the same class [Hofman and Wiggins (2008)]. They are partic- 
ularly suitable for the analysis of affiliation networks [Latouche, Birmele and 
Ambroise (2009)]. Most existing community discovery algorithms are based 
on the modularity score of Girvan and Newman (2002). However, Bickel and 
Chen (2009) showed that these algorithms were (asymptotically) biased and 
that using modularity scores could lead to the discovery of an incorrect com- 
munity structure, even for large graphs. The model of Handcock, Raftery 
and Tantrum (2007) which extends Hoff, Raftery and Handcock (2002) is an 
alternative approach. Vertices are clustered depending on their positions in 
a continuous latent space. They proposed a Bayesian inference procedure, 
based on Markov Chain Monte Carlo (MCMC), which is implemented in the 
R package latentnet [Krivitsky and Handcock (2009)], as well an asymptotic 
BIC criterion. Other models look for disassortative mixing in which vertices 
mostly connect to vertices of different classes. They are commonly used to 
analyze bipartite networks [Estrada and Rodriguez- Velazquez (2005)] which 
are present in many applications. For more details, see Newman and Leicht 
(2007). 

The Stochastic Block Model (SBM) can uncover heterogeneous structures 
in a large variety of networks [Latouche, Birmele and Ambroise (2009)] . Orig- 
inally developed in the social sciences, SBM is a probabilistic generalization 
[Fienberg and Wasserman (1981); Holland, Laskey and Leinhardt (1983)] of 
the method described in White, Boorman and Breiger (1976). Given a net- 
work, it assumes that each vertex belongs to a latent class among Q classes 
and uses &Q xQ connectivity matrix II to describe the connection probabil- 
ities [Frank and Harary (1982)]. No assumption is made on II such that SBM 
is a very flexible model. In particular, it can be used, among others, to look 
for community structure and disassortative mixing. Many inference meth- 
ods have been employed to estimate the SBM parameters. They all face the 
same problem. Indeed, contrary to Gaussian mixture models or other usual 
mixture models, the posterior distribution p(Z|X), of all the hidden label 
variables, given the observation X, cannot be factorized due to conditional 
dependency. Nowicki and Snijders (2001) proposed a Bayesian probabilistic 
approach. Their algorithm is implemented in the software BLOCKS, which 
is part of the package StoCNET [Boer et al. (2006)]. It uses Gibbs sampling 
to approximate the posterior distributions and leads to accurate a posteri- 
ori estimates. Two model based criteria have been proposed to choose the 
optimal value of Q. Thus, Daudin, Picard and Robin (2008) used an ICL 
criterion, based on a Laplace approximation of the Integrated Classification 
Likelihood, while Latouche, Birmele and Ambroise (2009) used a nonasymp- 
totic approximation of the marginal likelihood. For an extensive discussion 
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on statistical network models and blockmodel selection, we refer to Golden- 
berg et al. (2010). 

A drawback of existing graph clustering techniques is that they all par- 
tition the vertices into disjoint clusters, while lots of objects in real world 
applications typically belong to multiple groups or communities. For in- 
stance, many proteins, so-called moonlighting proteins, are known to have 
several functions in the cells [Jeffery (1999)], and actors might belong to sev- 
eral groups of interests [Palla et al. (2005)]. Thus, a graph clustering method 
should be able to uncover overlapping clusters. This issue has received grow- 
ing attention in the last few years, starting with an algorithmic approach 
based on small complete sub-graphs developed by Palla et al. (2005) and 
implemented in the software CFinder [Palla et al. (2006)]. They defined a k- 
clique community as a union of all fc-cliques (complete sub-graphs of size k) 
that can be reached from each other through a series of adjacent 2 fc-cliques. 
Given a network, their algorithm first locates all cliques and then identifies 
the communities using a clique-clique overlap matrix [Everett and Borgatti 
(1998)]. By construction, the resulting communities can overlap. In order to 
select the optimal value of k, the authors suggested a global criterion which 
looks for a community structure as highly connected as possible. Small val- 
ues of k lead to a giant community which smears the details of a network by 
merging small communities. Conversely, when k increases, the communities 
tend to become smaller, more disintegrated, but also more cohesive. There- 
fore, they proposed a heuristic which consists in running their algorithm for 
various values of k and then to select the lowest value such that no giant 
community appears. 

More recent work [Airoldi et al. (2008)] proposed the Mixed Member- 
ship Stochastic Block model (MMSB) which has been used with success to 
analyze networks in many applications [Airoldi et al. (2007); Airoldi et al. 
(2006)]. They used variational techniques to estimate the model parame- 
ters and proposed a criterion to select the number of classes. As detailed in 
Heller, Williamson and Ghahramani (2008), mixed membership models, as 
Latent Dirichlet Allocation [Blei, Ng and Jordan (2003)], are flexible models 
which can capture partial membership [Griffiths and Ghahramani (2005); 
Heller and Ghahramani (2007)], in the form of attribute-specific mixtures. 
In MMSB, a mixing weight vector 7Tj is drawn from a Dirichlet distribution 
for each vertex in the network, 7Tj g being the probability of vertex i to belong 
to class q. The edge probability from vertex i to vertex j is then given by 
Pij — ^i^j^^i^-i-, where B is a Q x Q matrix of connection probabilities 
similar to the II matrix in SBM. The vector is sampled from a multi- 
nomial distribution A4(l, 7Tj) and describes the class membership of vertex i 
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in its relation toward vertex j . By symmetry, the vector Zj^-j is drawn from 
a multinomial distribution A4(l, iVj) and represents the class membership 
of vertex j in its relation toward vertex i. Thus, depending on its relations 
with other vertices, each vertex can belong to different classes and, therefore, 
MMSB can be viewed as allowing overlapping clusters. However, the limit 
of MMSB is that it does not produce edges which are themselves influenced 
by the fact that some vertices belong to multiple clusters. Indeed, for every 
pair of vertices, only a single draw of Z^j and Zj^j determines the 
probability pij of an edge, all the other class memberships of vertex i and 
j toward other vertices in the network do not play a part. In this paper we 
present a complementary approach which tackles this issue. 

Fu and Banerjee (2008) modeled overlapping clusters on Q components by 
characterizing each individual % by a latent {0, 1} vector Z{ of length Q drawn 
from independent Bernoulli distributions. The ith row of the data matrix 
then only depends on Zi. In the underlying clustering structure, i belongs 
to the components corresponding to a 1 in Zj. Nevertheless, the proposed 
model needs Q parameters for each individual and supposes independence 
between rows and columns of the data matrix, which is not the case when 
looking for network structures. 

In this paper we propose a new model for generating networks, depend- 
ing on (Q + l) 2 + Q parameters, where Q is the number of components in 
the mixture. A latent {0, l}-vector of length Q is assigned to each vertex, 
drawn from products of Bernoulli distributions whose parameters are not 
vertex-dependent. Each vertex may then belong to several components, al- 
lowing overlapping clusters, and each edge probability depends only on the 
components of its endpoints. 

In Section 2 we briefly review the stochastic block model introduced by 
Nowicki and Snijders (2001). In Section 3 we present the overlapping stochas- 
tic block model and we show in Section 4 that the model is identifiable within 
classes of equivalence. In Section 5 we propose an EM-like algorithm to infer 
the parameters of the model. Finally, in Section 6 we compare our work with 
other approaches using simulated data and two real networks. We show the 
efficiency of our model to detect overlapping clusters in networks. 

2. The stochastic block model. In this paper we consider a directed bi- 
nary random graph Q represented by an N x N binary adjacency matrix X. 
Each entry X^ describes the presence or absence of an edge from vertex i to 
vertex j. We assume that Q does not have any self loop, and, therefore, the 
variables Xa will not be taken into account. The Stochastic Block Model 
(SBM) introduced by Nowicki and Snijders (2001) associates to each vertex 
of a network a latent variable Zj drawn from a multinomial distribution: 

Zi~ M(l,a= (ai,a 2 ,. ■ ■ ,a Q )), 
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where en denotes the vector of class proportions. As in other standard mix- 
ture models, the vector Zj sees all its components set to zero except one 
such that Zi q = 1 if vertex i belongs to class q. The model then verifies 

Q 

(2.1) E z "? = i Vie{i,...,JV} 

q=l 

and 

Q 

(2.2) E a ^ = L 

9=1 

Finally, the edges of the network are drawn from a Bernoulli distribution: 

Xij\{Z iq Zji = 1} ~B(n q i), 

where II is a Q x Q matrix of connection probabilities. According to this 
model, the latent variables Zi,...,Zjy are i.i.d. and given this latent struc- 
ture, all the edges are supposed to be independent. Note that SBM was orig- 
inally described in a more general setting, allowing any discrete relational 
data. However, as explained previously, we concentrate in the following on 
binary edges only. 

3. The overlapping stochastic block model. In order to allow each vertex 
to belong to multiple classes, we relax the constraints (2.1) and (2.2). Thus, 
for each vertex i of the network, we introduce a latent vector Zj, of Q inde- 
pendent Boolean variables Zi q G {0, 1}, drawn from a multivariate Bernoulli 
distribution: 

Q Q 

(3.1) Zt ~ JJ B(Z iq ; a q ) = JJ a? " (1 - a q f- z ^ . 

q=l q=l 

We point out that Zj can also have all its components set to zero which 
is a useful feature in practice as described in Sections 3.2 and 6. The edge 
probabilities are then given by 

Xij | Z t , Zj ~ B(Xij ; g(a z . , Zj ) ) = e X ^ z > g( -a z , ; , z . ) , 

where 

(3.2) a Zi)Zi = ZjWZj + ZjXJ + \ T Z 3 + W* , 

and g(x) = (1 + e~ x )~ 1 is the logistic sigmoid function. W is a Q x Q real 
matrix, whereas U and V are Q-dimensional real vectors. The first term 
in the right-hand side of (3.2) describes the interactions between the ver- 
tices i and j. If i belongs only to class q and j only to class I, then only 
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Table 1 

The values o/oz ; ,z- in functions ofLi (rows) and Zj (columns) for an overlapping 
stochastic block model with Q — 2 





(0,0) 


(1,0) 


(0,1) 


(1,1) 


(0,0) 


w* 


Vi + w* 


v 2 + w 


V1 + V2 + w* 


(1,0) 


Ui + W* 


Wu + U1 + V1 + W* 


Wu + Ui + V 2 + W* 


Wu + W12 + Ui 










+ Vi + v 2 + w 


(0,1) 


u 2 + w* 


W21 + U2 + VI + W* 


W22 + U2 + V2 + W* 


W21 + W22 + U2 










+ Vi + v 2 + w 


(1,1) U1 + U2 + W* 


Wu + W21 + Ui 


W12 + W22 + Ui 


Wu + W 12 + W21 






+ U 2 + Vi + w 


+ U2 + V2 + W* 


+ W22 + U1 + U2 










+ V! + V 2 + W* 



one interaction term remains (Z^WZj = W q i). However, as illustrated in 
Table 1, the model can take more complex interactions into account if one 
or both of these two vertices belong to multiple classes (Figure 1). Note 
that the second term in (3.2) does not depend on Zj. It models the overall 
capacity of vertex i to connect to other vertices. By symmetry, the third 
term represents the global tendency of vertex j to receive an edge. These 
two parameters U and V are related to the sender /receiver effects Si and 7^ 
in the Latent Cluster Random Effects Model (LCREM) of Krivitsky et al. 
(2009). However, contrary to LCREM, <5j = ZTu and 7,- = V T Zj depend on 
the classes. In other words, two different vertices sharing the same classes 
will have exactly the same sender /receiver effects, which is not the case in 
LCREM. Finally, we use the scalar W* as a bias, to model sparsity. 




Fig. 1. Example of a directed graph with three overlapping clusters. 
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If we associate to each latent variable Zj a vector Zj = (Zj, 1) , then (3.2) 
can be written 

(3-3) a z . z .=Z7WZ,-, 
where 

V v w * ) 

The 's can be seen as random variables drawn from a Bernoulli distri- 

bution with probability ag+i = 1. Thus, one way to think about the model 
is to consider that all the vertices in the graph belong to a {Q + l)th cluster 
which is overlapped by all the other clusters. In the following, we will use 

(3.3) to simplify the notation. 

Finally, given the latent structure Z = {Zi, . . . , Zjv}, all the edges are 
supposed to be independent (see Figure 2). Thus, when considering directed 
graphs without self-loop, the Overlapping Stochastic Block Model (OSBM) 
is defined through the following distributions: 

N Q 

(3.4) P (zia) = n n - ^f-^ 

i=l g=l 

and 

N 

p(X\Z,W) = He X 'J az ^g(-a z ^). 




8 



P. LATOUCHE, E. BIRMELE AND C. AMBROISE 



3.1. Modeling sparsity. As explained in Airoldi et al. (2008), real net- 
works are often sparse 3 and it is crucial to distinguish the two sources of 
noninteraction. Sparsity might be the result of the rarity of interactions in 
general, but it might also indicate that some class (intra or inter) connec- 
tion probabilities are close to zero. For instance, social networks (see Section 
6.2) are often made of communities where vertices are mostly connected to 
vertices of the same community. This corresponds to classes with high intra 
connection probabilities and low inter connection probabilities. In (3.2) we 
can notice that W* appears in ozj.z,- for every pair of vertices. Therefore, 
W* is a convenient parameter to model the two sources of sparsity. Indeed, 
low values of W* result from the rarity of interactions in general, whereas 
high values signify that sparsity comes from the classes (parameters in W, 
U and V). 

3.2. Modeling outliers. When applied on real networks, graph clustering 
methods often lead to giant classes of vertices having low output and input 
degrees [Daudin, Picard and Robin (2008); Latouche, Birmele and Ambroise 
(2009)]. These classes are usually discarded and the analysis of networks 
focus on more highly structured classes to extract useful information. The 
product of Bernoulli distributions (3.4) provides a natural way to encode 
these "outliers." Indeed, rather than using giant classes, OSBM uses the 
null component such that Zj = if vertex i is an outlier and should not be 
classified in any class. 

4. Identifiability. Before looking for an optimization procedure to esti- 
mate the model parameters, given a sample of observations (a network), it is 
crucial to verify whether OSBM is identifiable. A theorem of Allman, Matias 
and Rhodes (2009) lies at the core of the results presented in this section. 

If we denote .F(0) = {P#, € 0}, a family of models we are interested in, 
the classical definition of identifiability requires that for any two different 
values 6 y£ 6' , the corresponding probability distributions P# and F g i are 
different. 

4.1. Correspondence with (nonoverlapping) stochastic block models. Let 
©OSBM be the parameter space of the family of OSBMs with Q classes: 

©OSBM = {(«,W)€[0,l] C3 xM( c 3+i) 2 } . 

Each in Gosbm corresponds to a random graph model which is defined 
by the distribution p(X.\ot, W). The aim of this Section is to characterize 
whether there exists any relation between two different parameters and 0' 
in ©osBM) leading to the same random graph model. 



3 The corresponding adjacency matrices contain mainly zeros. 
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We consider the (nonoverlapping) Stochastic Block Model (SBM) intro- 
duced by Nowicki and Snijders (2001). The model is defined by a set of 
classes C, a vector of class proportions 7 = {7c}ceC verifying Ylcec^ c = ^' 
and a matrix of connection probabilities II = (IIc,d)c Dee 2 - Note that they 
are an infinite number of ways to represent and encode the classes. For 
simplicity, a common choice is to set C = {1, . . . , Q} and possibly C = {C € 
{0, 1}^, Ylq=i = 1}' f° r a m °del with Q classes. The random graphs are 
drawn as follows. First, the class of each vertex is sampled from a multino- 
mial distribution with parameters (l,*y). Thus, each vertex i belongs only 
to one class, and that class is C with probability 7c- Second, the edges 
are drawn independently from each other from Bernoulli distributions, the 
probability of an edge being IIc,d, if * belongs to class C and j to 

class D. 

Let ©sbm be the parameter space of the family of SBMs with 2^ classes: 
9sbm = { (7, n) € [0, if x [0, if Q , 7C = 1} ■ 

Considering that each possible value of the vectors Zj's in an OSBM with 
Q classes encodes a class in a SBM with 2<2 classes (i.e., C = {0, 1} Q ) yields 
a natural function: 

, ©OSBM — > ©SBM 

^ : (a,W)^( 7 ,n)' 

where 

Q 

7c = n«? 9 ( 1 -^) 1 " C? VCG{0,1}« 

q=l 

and 

n c ,D = ff(C T WD + C T U + V T D + w*) 

V(C,D) e {0,1} Q x {0,1} Q . 

Let Qm denote the set of probability measures on the graphs of N vertices. 
The OSBM of parameter 6 in Gosbm and the SBM of parameter 4>(0) in 
©sbm clearly induce the same measure [J, in Gn- Thus, denoting by 1^(7, II) 
the probability measure in Gn induced by the SBM of parameter (7, II), the 
problem of identifiability is to characterize the relations between parameters 
€ ©OSBM and 0' € 9 OS BM such that ?/#(#)) = ^{4>{0')): 

©OSBM ©sbm — > Gn, 
= (a,W)4( 7 ,n)4 fi. 
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The identifiability of SBM was studied by Allman, Matias and Rhodes 
(2009), who showed that the model is generically identifiable up to a per- 
mutation of the classes. In other words, except in a set of parameters which 
has a null Lebesgue measure, two parameters imply the same random graph 
model if and only if they differ only by the ordering of the classes. There- 
fore, the main theorem of Allman, Matias and Rhodes (2009) implies the 
following result. 

Theorem 4.1. There exists a set ©gfjM C ©sbm of null Lebesgue mea- 
sure such that, for every (7, n) and (7', II') not in ©sbm, ^(l, n) = VKt'i n') 
if and only if there exists a function P v such that (7', II') = P u ((-f,Il)) , 
where: 

• v is a 'permutation on {0, 1}^, 

• 7 / c=7,(c),VC€{0,l}« ; ' 

. n' c D = n KC)iKD) , v(c, D) g {0, i}Q x {0, i}« . 

Thus, studying the generical identifiability of the OSBM is equivalent to 
characterizing the parameters of ©osbm verifying 4>{6') = P u ((p(0)) for some 
permutation v on {0,1}^. 

4.2. Permutations and inversions. As in the case of the SBM, reordering 
the Q classes of the OSBM and doing the corresponding modification in en 
and W does not change the generative random graph model. Indeed, let a be 
a permutation on {1, . . . , Q} and let P a denote the function corresponding 
to the permutation a of the classes. Then, (a', W) = P a (a,W) is defined 
by 

a 'q = a <T{q) Vge{l,...,Q}, 

and 

W^W,^,) V(g,Z) e{l,...,Q + l} 2 . 

Now, let v be the permutation of {0, 1}^ defined by 

u(C) = (C a{1) ,...,C a{Q) ) VCe{0,l}«. 

It is then straightforward to see that, for every parameter 6 in ©osbm and 
every permutation a, (f)(P a (0)) = P v ((f)(9)), where P v is defined in Theo- 
rem 4.1. 

There is another family of operations in ©osbm which does not change the 
generative random graph model, which we call inversions. They correspond 
to exchanging the labels to 1 and vice versa on some of the coordinates of 
the Zi's. To give an intuition, consider a parameter = (a, W) in ©osbm- 
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Let us generate graphs under the probability measure in Qn induced by 9 
and consider only the first coordinate of the Z^s. If we denote by "cluster 1" 
the vertices whose Zj's have a 1 as first coordinate, the graph sampling pro- 
cedure consists in sampling the set "cluster 1" and then drawing the edges 
conditionally on that information. Note that it would be equivalent to sample 
the vertices which are not in "cluster 1" and to draw the edges condition- 
ally on that information. Thus, there exists an equivalent reparametrization 
where the l's in the first coordinate correspond to the vertices which are 
not in "cluster 1." This is the parameter 9' obtained from 9 by an inversion 
of the first coordinate. 

Let A be any vector of {0, 1}^. We define the A-inversion /a as follows: 

T ©OSBM — > ©OSBM 

a: (a,W)->(a',W')' 

where 

3 1 ctj , otherwise 1 ' ' 

and 

W' = ]VIXWMa. 

The matrix Ma is defined by 

M^('- 2di i A) f), 

with diag(A) being the Q x Q diagonal matrix whose diagonal is the vector 
A. 



Proposition 4.1. For every A £ {0,1}^, let v be the permutation of 
{0, 1} Q defined by 



VCG{0,1} Q u{C) 



1 - C h if Ai = 1, 
Ci, otherwise. 



Then, for every 9 in 0OSBM; 

0(Ia(0))=P„(0(0)), 
where P v is defined in Theorem 4-1- 



PROOF. Consider 9 G 0osbm and A G {0, 1} Q and define (7,11) = (f>(9) 
and (7',n') = (p(Ix{9)). It is straightforward to verify that 

70 = 7,(0 VCG{0,1}« 
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'cW„(C)' 



Moreover, since Ma { 1 j = y\ ) j,it follows that 

n / c, D =5((C T l)MlWM A ^jj 

Therefore, </>(Ia(0)) = MH 6 )) • D 

4.3. Identifiability. Let us define the following equivalence relation: 
0~e' if 3a, A 1 6' = I A (P„(0)). 
To be convinced that it is an equivalence relation, note that 

^A -P ( 7=-P<7°^ ( 7-1(A)- 

Consider the set of equivalence classes for the relation ~. It follows that: 

• Two parameters in the same equivalence class induce the same measure 
in Q N . _____ 

• Each equivalence class contains a parameter 6 = (a, W) such that ai < 
q-2 < . . . < a Q < Moreover, if the a^'s are all distinct and strictly lower 
than ^, there is a unique such parameter in the equivalence class. 

We are now able to state our main theorem about identifiability, that is, 
that the model is generically identifiable up to the equivalence relation ~. 

Theorem 4.2. For every cx G ]0, 1[ Q , let (3 G be the vector defined by 
Pk = ~ ln(Y^), for every k. 

Define ©qsbm as ^ e se ^ of parameters (a, W) such that one of the fol- 
lowing conditions holds: 

• there exists 1 < k < Q such that a k = or a^ = \ or ak = \, 

• there exist 1 < k,l < Q such that a k = a_ ; 

• there exist C, D G {0, 1} Q x {0, 1} Q such that Y^k/ 3 kC k = J2kl 3 k D k, 

• (f>(cx, W) G @SBM' se ^ °f nu H measure given by Theorem 4-1- 
Then Sqsbm has a null Lebesgue measure on Oosbm o,nd 

v0,0 / G(e O sBM\e o a s d BM) 2 m = <P(o') o~e'. 

Proof, ©osbm * s the umon of a finite number of hyperplanes or spaces 
which are isomorphic to hyperplanes. Therefore, /■'(©osbm) = 0- ^ 
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Let = (a,W), ff / = (a',W / ), 0(0) = (7,n) and <p{6') = (7', II'). As <f> 
is constant on each equivalence class and as and 6' are not in 6qsbm' we 
can assume that < a\ < ■ ■ ■ < a>k < \ and < a[ < ■ ■ ■ < a' k < \. Proving 
the theorem is then equivalent to proving that 6 = 6'. 

As 4>(6) = 4>(6 f ), Theorem 4.1 ensures that there exists a permutation 
v : {0, 1} Q -> {0, 1} Q such that 



7c=7i/(C) VC, 
n c,D = n </>(c),»MD) VC ' D 



Then, in particular, 



(4.1) 



The minima of those two sets as well as the second lowest values are equal, 
that is, 

n ak = n ^ and ( n ak ) ^ ~ ^ = ( n ) ^ ~ a 'o^- 

Dividing those equations term by term yields = and finally oq = 

Oq. Dividing all terms by ag Q (l — oq) 1 <2 in (4.1), by induction, it follows 
that 

(4.2) a = a'. 

Now, for any C G {0, 1}^, the fact that -y' c = 7„(c) can De written as 

n «? (i - a,) 1 -^ = n a - «*) i - v(c °* > 

fc k 
£ Cfc ln (l^) + S ln (! " «*) = E "(30 In (^) + £ " a*) 

k k 

Since 6 ^ @osbmi this i m phes that v(C) = C. As it is true for every C, v is 
in fact the identity function. 

Therefore, for every C,D, IIc,d = hlc D' that is, 

£ "tyCgdj + £ n <> c <? + + w * = £ + E U 9 C ? + E v ^ + w '* • 

g,Z q I q : l q I 
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Applying it for C = D = implies W* = W 
Applying it for D = and C = S q , where S q is the vector having a 1 on 
the qth. coordinate and O's elsewhere yields u q + W* = u' q + W* and, thus, 



u' q . 



By symmetry, C = and D = Si implies vi = v t . 
Finally, C = 5 q and D = 5i gives W ql = W' qV 
Thus, 

(4.3) W = W'. 

By equations (4.2) and (4.3), we have 6 = 6'. 

5. Statistical inference. Given a network, our aim in this section is to 
estimate the OSBM parameters. 

The log-likelihood of the observed data set is defined through the marginal- 
ization: p(X|ct, W) = ^zp(X, Z|a,W). This summation involves 2 N ® terms 
and quickly becomes intractable. To tackle this issue, the Expectation- 
Maximization (EM) algorithm has been applied on many mixture models. 
However, the E-step requires the calculation of the posterior distribution 
p(Z|X, a, W) which cannot be factorized in the case of networks [see Daudin, 
Picard and Robin (2008) for more details]. In order to obtain a tractable 
procedure, we present some approximations based on global and local vari- 
ational techniques. 

5.1. The q-transformation. Given a distribution (?(Z), the log-likelihood 
of the observed data set can be decomposed using the Kullback-Leibler 
divergence KL(-||-): 

(5.1) lnp(X|a,W) = £(g;a,W)+KL(g(-)||p(-|X,a,W)), 
where 

(5.2) £(g;a ,w) = £,(z)m{ gg^g) } 



and 
(5.3) 



KL(<,(. )M .|X,c W» = -Eg(Z) In{ P(Z| ^ W> 



The maximum lnp(X|ci!,W) of the lower bound C (5.2) is reached when 
q(Z) = p(Z|X, a, W). Thus, if the posterior distribution p(Z|X, a, W) was 
tractable, the optimizations of C and lnp(X|o;, W), with respect to a and 
W, would be equivalent. However, in the case of networks, p(Z|X, a,W) 
cannot be calculated and C cannot be optimized over the entire space of 
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q(Z) distributions. Thus, we restrict our search to the class of distributions 
which satisfy 

N 

(5-4) ?(Z) = n?(Zi)> 

1=1 

with 

Q Q 
q(Zi) =l[B(Z iq ;T iq ) = flrf (1 - r^) 1 "^. 

q=l q=l 

Each Ti q is a variational parameter which corresponds to the posterior prob- 
ability of node i to belong to class q. As for the vector a, the vectors 
t% = {t~h, ■ ■ ■ , Tiq} are not constrained to lie in the (Q — l)-dimensional sim- 
plex. 

Proposition 5.1. [Proof in Latouche, Birmele and Ambroise (2010), 
Appendix A]. The lower bound of the observed data log-likelihood is given by 

N 

C{q- a, W) = ^{A- v t/ Wt, + E Zl , Zj [hxg{-a^)}} 
N Q 

(5.5) + ^^{rj g lnag + (1 - r i(? )m(l - a q )} 

i=l 9=1 

N Q 

~ ^ ^{^Q lnT «/ + (1 - Tig) In (1 - T ig )}. 

i=l q=l 

Unfortunately, since the logistic sigmoid function is nonlinear, 
Ez 4 ,z 3 [hig(— az^Zj)] in (5.5) cannot be computed analytically. Thus, we 
need a second level of approximation to optimize the lower bound of the 
observed data set. 

5.2. ^-transformation. 

Proposition 5.2 [Proof in Latouche, Birmele and Ambroise (2010) in 
Appendix A]. Given a variational parameter ^ , Ez^z.,- [hig(— az i; Zj)] sat- 
isfies 

Ez ! ,z J [lng(-a Zi ,z j )] 

(5.6) 



> mg(^) - £Lgg* + fa) _ Ai-HK, .z. [(ZjWZjf] - 
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Eventually, a lower bound of the first lower bound can be computed: 
(5.7) mp(X|a,W) >£(q;cc,W) >£(<?;«, W,£), 

where 

C(q;a, W,£) = ^| (x {j - ^jfJWfj + lng(&) - ^ 

- A(e ii )(Tr(W T E;W5] j ) + fTw^Wr,,- - £?•)} 

W Q 

+ ^^{r i9 lna g + (1 - r«,)ln(l - a g )} 

i=l (j=l 
AT Q 

- Yl J2i T ig lxiT n + ( X ~ T *l) H 1 ~ T iq)}- 

i=l q=l 

The resulting variational EM algorithm (see Algorithm 1) alternatively 
computes the posterior probabilities Tj and the parameters a and W max- 
imizing 

max£(g;a!,W,£). 

The optimization equations are given in Latouche, Birmele and Ambroise 
(2010), Appendix B. 

The computational cost of the algorithm is equal to 0(N 2 Q 4 ). For com- 
parison the computational cost of the methods proposed by Daudin, Picard 
and Robin (2008) and Latouche, Birmele and Ambroise (2009) for (nonover- 
lapping) SBM is equal to 0{N 2 Q 2 ). Analyzing a sparse network with 100 
nodes takes about ten seconds on a dual core, and about a minute for dense 
networks. 

For all the experiments we present in the following section, set a 2 = 0.5 
and we used the Ascendant Hierarchical Classification (AHC) algorithm 
implemented in the R package "mixer" which is available at the following: 
http : //cran . r-pro j ect . org/web/packages/mixer. 

6. Experiments. We present some results of the experiments we carried 
out to assess OSBM. Throughout our experiments, we compared our ap- 
proach to SBM (the nonoverlapping version of OSBM), the Mixed Member- 
ship Stochastic Block model (MMSB) of Airoldi et al. (2008), and the work 
of Palla et al. (2005), implemented in the software (Version 2.0.1) CFinder 
[Palla et al. (2006)]. 
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Algorithm 1 Overlapping stochastic block model for directed graphs without 
self loop. 

// INITIALIZATION 

Initialize t with an Ascendant Hierarchical Classification algorithm Sample W from a 
zero mean 

a 2 spherical Gaussian distribution 
// OPTIMIZATION 
repeat 

/ / ^-transformation 
for € V do 

& <- ^/Tr(W T E^WSj) + f/WT^Wf, 

end 

// M-step 

for q = 1 : Q do 



Optimize £(g; a,W,£) with respect to W, with a gradient based optimization al- 
gorithm 

[e.g., quasi-Newton method of Broyden et al. (1970)] 

// E-step 

repeat 

for i = 1 : N do 

Optimize C(q;a, W,£) with respect to n, with a box constrained (n q £ 

[0,1]) 

gradient based optimization algorithm [e.g., Byrd method Byrd et al. (1995)] 

end 

until r converges 
until C{q; a, W, £) converges 



In order to perform inference in SBM, we used the variational Bayes al- 
gorithm of Latouche, Birmele and Ambroise (2009) which approximates the 
posterior distribution over the latent variables and model parameters, given 
the edges. We computed the Maximum A Posteriori (MAP) estimates and 
obtained the class membership vectors Zj. We recall that SBM assumes that 
each vertex belongs to a single class and, therefore, each vector Zj has all its 
components set to zero except one, such that Zi q = 1 if vertex i is classified 
into class q. For OSBM, we relied on the variational approximate inference 
procedure described in Section 5 and computed the MAP estimates. Con- 
trary to SBM, each vertex can belong to multiple clusters and, therefore, the 
vectors Zj can have multiple components set to one. As described in Sec- 
tion 1, MMSB can also be viewed as allowing overlapping clusters. For more 
details, we refer to Airoldi et al. (2008). In order to estimate the MMSB 



18 



P. LATOUCHE, E. BIRMELE AND C. AMBROISE 



mixing weight vectors 7Tj, we used the collapsed Gibbs sampling approach 
implemented in the R package Ida [Chang (2010)]. We then converted each 
vector 7Tj into a binary membership vector Zj using a threshold t. Thus, 
for TTi q > t, we set Zi q = 1 and Zi q = otherwise. In all the experiments 
we carried out, we defined t = 1/Q and we found that for higher values 
MMSB tended to behave like SBM. Finally, we considered CFinder which is 
a widely used algorithmic approach to uncover overlapping communities. As 
described in Section 1, CFinder looks for fc-clique communities where each 
A>clique community is a union of all fc-cliques (complete sub-graphs of size k) 
that can be reached from each other through a series of adjacent fc-cliques. 
The algorithm first locates all cliques and then identifies the communities 
and overlaps between communities using a clique-clique overlap matrix [Ev- 
erett and Borgatti (1998)]. Vertices that do not belong to any /c-clique are 
seen as outliers and not classified. 

Contrary to OSBM (and CFinder), SBM and MMSB cannot deal with 
outliers. Therefore, to obtain fair comparisons between the approaches, when 
OSBM was run with Q classes, SBM and MMSB were run with Q + l classes 
and we identified the class of outliers. In practice, this can easily be done 
since this class contains most of the vertices of the network having low output 
and input degrees. 

The code implementing all the experiments is available upon request. 

6.1. Simulations. In this set of experiments we generated two types of 
networks using the OSBM generative model. In Section 6.1.1 we sampled 
networks with community structures (Figure 3) , where vertices of a commu- 
nity are mostly connected to vertices of the same community. To limit the 
number of free parameters, we considered the Q x Q real matrix W: 

( A -e ... -e\ 

-e A ! 



(6.1) 



W 



A/ 



In Section 6.1.2 we generated networks with more complex topologies, 
using the matrix W: 



■ 2) 



W 



A -e ... 

-A -e ... 

-e A A 

-e -A 

: — e 



\ 



—e 
A 

-A 
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Fig. 3. Example of a network with community structures. Overlaps are represented in 
black and outliers in gray. 

In these networks, if class i is a community and has therefore a high intra 
connection probability, then its vertices also highly connect to vertices of 
class i + 1 which itself has a low intra connection probability. Such star 
patterns (Figure 4) often appear in transcription networks, as shown in 
Section 6.3, and protein-protein interaction networks. 



For these two sets of experiments, we used the Q-dimensional real vectors 
U and V: 



and we set Q = 4, A = 4, e = l, W* = —5.5. Moreover, for the vector a 
of class probabilities, we set a q = 0.25, Vg S {1, ...,Q}. We generated 100 
networks with N = 100 vertices and for each of these networks, we clustered 
the vertices using CFinder, SBM, MMSB and OSBM. Finally, we used a 
criterion similar to the one proposed by Heller and Ghahramani (2007); 
Heller, Williamson and Ghahramani (2008) to compare the true Z and the 
estimated Z clustering matrices. Thus, for each network and each method, 
we computed the L2 distance d(P,P) where P = ZZ T and P = ZZ T . These 
two N x N matrices are invariant to column permutations of Z and Z and 
compute the number of shared clusters between each pair of vertices of a 
network. Therefore, (f(P,P) is a good measure to determine how well the 
underlying cluster assignment structure has been discovered. Since CFinder 
depends on a parameter k (size of the cliques), for each simulated network, 
we ran the software for various values of k and selected k for which the L2 



(6.3) 



U = V = (e 
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Fig. 4. Example of a network with community structures and stars. Overlaps are repre- 
sented in black and outliers in gray. 

distance was minimized. Note that this choice of k tends to overestimate 
the performances of CFinder compared to the other approaches. Indeed, in 
practice, when analyzing a real network, k needs to be estimated (see Section 
6.2), while P is unknown. OSBM was run with Q classes, whereas SBM and 
MMSB were run with Q + 1 classes. For both SBM and MMSB, and each 
generated network, after having identified the class of outliers, we set the 
latent vectors of the corresponding vertices to zero (null component). The 
L2 distance d(P,P) was then computed exactly as described previously. 

6.1.1. Networks with community structures. The results that we obtained 
are presented in Table 2 and in Figure 5. We can observe that CFinder, 
MMSB and OSBM lead to very accurate estimates Z of the true clustering 

Table 2 

Comparison of CFinder, SBM, MMSB and 
OSBM in terms of the L2 distance d(P,P) 
over the 100 samples of networks with 
community structures 





Mean 


Median 


Min 


Max 


CFinder 


43.53 


22 





203 


SBM 


116.46 


103.3 





321 


MMSB 


53.76 


27.5 





293 


OSBM 


41.83 








258 



OVERLAPPING STOCHASTIC BLOCK MODELS 



21 




CFinder SBM MMSB OSBM 

Fig. 5. L2 distance d(P,P) over the 100 samples of networks with community struc- 
tures, for CFinder, SBM, MMSB and OSBM. Measures how well the underlying cluster 
assignment structure has been retrieved. 

matrix Z. For most networks, they retrieve the clusters and overlaps per- 
fectly, although CFinder and MMSB appear to be slightly biased. Indeed, 
while the median of the L2 distance d(P,P) over the 100 samples is null for 
OSBM, it is equal to 22 for CFinder and 27.5 for MMSB. Since CFinder is 
an algorithmic approach, and not a probabilistic model, it does not classify 
a vertex Vi if it does not belong to any fe-cliques of a /c-clique community. 
Conversely, OSBM is more flexible and can take the random nature of the 
network into account. Indeed, the edges are assumed to be drawn randomly, 
and, given each pair of vertices, OSBM deciphers whether or not they are 
likely to belong to the same class, depending on their connection profiles. 
Therefore, OSBM can predict that Vi belongs to a class q, although it does 
not belong to any ^-cliques. Overall, we found that MMSB retrieves the clus- 
ters well but often misclassifies some of the overlaps. Thus, if a given vertex 
belongs to several clusters, it tends to be classified by MMSB into only one 
of them. Nevertheless, the results clearly illustrate that MMSB improves 
over SBM, which cannot retrieve any of the overlapping clusters. It should 
also be noted that CFinder has fewer outliers (Figure 5) than MMSB and 
OSBM and appears to be slightly more stable when looking for overlapping 
community structures in networks. 

6.1.2. Networks with community structures and stars. In this set of ex- 
periments we considered networks with more complex topologies. As shown, 
in Table 3 and in Figure 6, the results of CFinder dramatically degrade while 
those of OSBM remain more stable. Indeed, the median of the L2 distances 
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Table 3 

Comparison of CFinder, SBM, MMSB and 
OSBM in terms of the L2 distance d(P,P) over 
the 100 samples of networks with community 
structures and stars 





Mean 


Median 


Min 


Max 


CFinder 


362.07 


354.5 


181 


567 


SBM 


134.68 


118.87 


15.14 


352.09 


MMSB 


119.01 


98.5 





367 


OSBM 


77 


43 





328 



d(P,P) over the 100 samples is equal to 43 for OSBM, while it is equal to 
354.5 for CFinder. This can be easily explained since CFinder only looks 
for community structures of adjacent fc-cliques, and cannot retrieve classes 
with low intra connection probabilities. Conversely, OSBM uses a Q x Q real 
matrix W and two real vectors U and V of size Q to model the intra and 
inter connection probabilities. No assumption is made on these matrix and 
vectors such that OSBM can take heterogeneous and complex topologies 
into account. As for CFinder, the results of MMSB degrade, although they 
remain better than SBM. As for the previous Section, MMSB retrieves the 
clusters well but misclassifies the overlaps more frequently when considering 
networks with community structures and stars. 




o 

in ~ 

o - 



CFinder SBM MMSB OSBM 

Fig. 6. Li distance d(P,P) over the 100 samples of networks with community structures 
and stars, for CFinder, SBM, MMSB and OSBM. Measures how well the underlying cluster 
assignment structure has been retrieved. 
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6.2. French political blogosphere. We consider the French political blo- 
gosphere network and we focus on a subset of 196 vertices connected by 
2864 edges. The data consists of a single day snapshot of political blogs au- 
tomatically extracted on the 14th of October 2006 and manually classified 
by the "Observatoire Presidentielle project" [Zanghi, Ambroise and Miele 
(2008)]. Nodes correspond to hostnames and there is an edge between two 
nodes if there is a known hyperlink from one hostname to another. The 
four main political parties which are present in the data set are the UMP 
(french "republican"), UDF ("moderate" party), liberal party (supporters 
of economic-liberalism) and PS (french "democrat"). Therefore, we applied 
our algorithm with Q = 4 clusters and we obtained the results presented in 
Figure 7 and Table 4. 

First, we notice that the clusters we found are highly homogeneous and 
correspond to the well-known political parties. Thus, cluster 1 contains 35 
blogs among which 33 are associated to UMP, while cluster 2 contains 39 
blogs among which 30 are related to UDF. Similarly, it follows that cluster 3 
corresponds to the liberal party and cluster 4 to PS. We found nine overlaps. 
Thus, three blogs associated to UMP belong to both clusters 1 (UMP) and 
2 (UDF). This is a result we expected since these two political parties are 
known to have some relational ties. Moreover, a blog associated to UDF 
belongs to both clusters 1 (UMP) and 4 (PS), while another UDF blog 
belongs to clusters 2 (UDF) and 4 (PS). This can be easily understood since 





UMP 


UDF 


liberal 


PS 


analysts 


others 


cluster 1 


30 + 3 


+ 1 








+ 1 





cluster 2 


2 + 3 


29 + 1 








1 +3 





cluster 3 








24 





1 + 1 





cluster 4 





+ 2 





40 


+ 4 


1 


outliers 


5 


1 


1 


17 


5 


30 



Fig. 7. Classification of the blogs into Q = 4 clusters using OSBM. The entry (i,j) of 
the matrix describes the number of blogs associated to the jth political party (column) and 
classified into cluster i (row). Each entry distinguishes blogs which belong to a unique 
cluster from overlaps (single membership blogs + overlaps). The last row corresponds to 
the null component. 
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____ Table 4 

The estimated W matrix for the classification of the blogs 
into Q = 4 clusters using OSBM. The 4x4 matrix on the top 
left-hand side represents the W matrix, while the vectors on 
the top right-hand side and bottom left-hand side represent 
the vectors U and V T respectively. The remaining term 
corresponds to the bias. The diagonal o/W indicates that 
blogs have a heavy tendency to connect to blogs of the same 
class. Blogs of cluster 1 (UMP) have also a positive tendency 
to connect to blogs of clusters 2 (UDF) and 3 (liberal party). 
Conversely, blogs of cluster 4 (PS), representing the left wing, 
are more isolated in the network 



3.89 


0.17 


0.54 


-0.70 


-0.70 


0.17 


2.47 


-0.40 


-0.84 


0.40 


0.55 


-0.40 


4.43 


-0.85 


-0.38 


-0.70 


-0.84 


-0.85 


1.66 


0.87 


-0.70 


0.40 


-0.38 


0.87 


-3.60 



UDF is a moderate party. Therefore, it is not surprising to find UDF blogs 
with links with the two biggest political parties in France, representing the 
left and right wings. Very interestingly, among the nine overlaps we found, 
four of them correspond to blogs of political analysts. Thus, a blog overlaps 
clusters 1 (UMP) and 4 (PS). Another one overlaps clusters 2 (UDF), 3 
(liberal party) and 4 (PS). Finally, the two last blogs of political analysts 
overlap clusters 2 (UDF) and 4 (PS). 

We ran CFinder and we used the criterion [Palla et al. (2005)] they pro- 
posed to select k (see Section 1). Thus, we ran the software for various 
values of k and we found k = 7. Lower values lead to giant components 
which smear the details of the network. Conversely, for higher values, the 
communities start disintegrating. Using k, we uncovered 11 clusters which 
correspond to sub-clusters of the clusters we found using OSBM. For in- 
stance, cluster 3 (liberal party) was split into two clusters, whereas cluster 
4 (PS) was split into three. Indeed, while OSBM predicted that the connec- 
tion profiles of these sub-clusters were very similar and therefore should be 
merged, CFinder could not uncover any fc-clique community, that is, a union 
of fully connected sub-graphs of size k, containing these sub-clusters. Note 
that using CFinder, we retrieved the overlaps uncovered by our algorithm. 
CFinder did not classify 95 blogs. 

We also clustered the blogs of the network using MMSB and SBM. As 
previously, for both models, we used Q + 1 clusters and we identified the class 
of outliers. The results of MMSB are presented in Figure 8. Overall, we can 
notice that MMSB led to similar clusters as OSBM, although cluster 4 is 
less homogeneous in MMSB than in OSBM. We found eight overlaps using 
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PS 
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+ 1 
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3 
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Fig. 8. Classification of the blogs into Q — 5 clusters using MMSB. The entry of 
the matrix describes the number of blogs associated to the jth political party (column) and 
classified into cluster i (row). Each entry distinguishes blogs which belong to a unique 
cluster from overlaps (single membership blogs + overlaps). Cluster 5 corresponds to the 
class of outliers. 

MMSB and we emphasize that five of them correspond exactly to the one 
found with our approach. Thus, the model retrieved two among the three 
UMP blogs overlapping clusters 1 (UMP) and 2 (UDF). Moreover, MMSB 
uncovered the UDF blog overlapping clusters 1 (UMP) and 4 (PS), as well 
as the blog of political analysts overlapping clusters 2 (UDF), 3 (liberal 
party) and 4 (PS). It also retrieved the blog of political analysts overlapping 
clusters 1 (UMP) and 4 (PS). Finally, the results of SBM are presented in 
Figure 9. Again, the clusters found by this approach are very similar to the 
one uncovered by OSBM. However, because SBM does not allow each vertex 
to belong to multiple clusters, it misses a lot of information in the network. In 
particular, while some of the blogs of political analysts are viewed as overlaps 
by OSBM, because of their relational ties with the different political parties, 
they are all classified into a single cluster by SBM. 

6.3. Saccharomyces cerevisiae transcription network. We consider the 
yeast transcriptional regulatory network described in Milo et al. (2002) and 
we focus on a subset of 197 vertices connected by 303 edges. Nodes of the 
network correspond to operons, and two operons are linked if one operon 
encodes a transcriptional factor that directly regulates the other operon. 
The network is made of three regulation patterns, each one of them having 
its own regulators and regulated operons. Therefore, using Q = 6 clusters, 
we applied our algorithm and we obtained the results in Table 5. 
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FlG. 9. Classification of the blogs into Q — 5 clusters using SBM. The entry of 
the matrix describes the number of blogs associated to the jth political party (column) and 
classified into cluster i (row). Cluster 5 corresponds to the class of outliers. 

First, we notice that clusters 1, 3 and 5 contain only two operons each. 
These operons correspond to hubs which regulate respectively the nodes of 
clusters 2, 4 and 6, all having a very low intra connection probability. To 
analyze our results, we used GOToolBox [Martin et al. (2004)] on each clus- 



Table 5 

Classification of the operons into Q = 6 clusters. Operons in bold belong to multiple 

clusters 



Cluster 


Size 


Operons 


1 


2 


STE12 TEC1 


2 


33 


YBR070C MID2 YEL033W SRD1 TSL1 RTS2 PRM5 YNL051W PST1 
YJL142C SSA4 YGR149W SP012 YNL159C SFP1 YHR156C YPS1 
YPL114W HTB2 MPT5 SRL1 DHH1 TKL2 PGU1 YHL021C RTA1 
WSC2 GAT4 YJL017W TOSH YLR414C BNI5 YDL222C 


3 


2 


MSN4 MSN2 


4 


32 


CPH1 TKL2 HSP12 SPS100 MDJ1 GRX1 SSA3 ALD2 GDH3 GRE3 
HOR2 ALD3 SOD2 ARA1 HSP42 YNL077W HSP78 GLK1 DOG2 
HXK1 RAS2 CTT1 HSP26 TPS1 TTR1 HSP104 GLOl SSA4 PNC1 
MTC2 YGR086C PGM2 


5 


2 


YAP1 SKN7 


6 


19 


YMR318C CTT1 TSA1 CYS3 ZWF1 HSP82 TRX2 GRE2 SOD1 AHP1 
YNL134C HSP78 CCP1 TALI DAK1 YDR453C TRR1 LYS20 PGM2 
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ter. This software aims at identifying statistically over-represented terms of 
the Gene Ontology (GO) in a gene data set. We found that the clusters 
correspond to well-known biological functions. Thus, the nodes of cluster 
2 are regulated by STE12 and TEC1 which are both involved in the re- 
sponse to glucose limitation, nitrogen limitation and abundant fermentable 
carbon source. Similarly, MSN4 and MSN2 regulate the nodes of cluster 4 in 
response to different stress such as freezing, hydrostatic pressure and heat 
acclimation. Finally, the nodes of cluster 6 are regulated by YAP1 and SKN7 
in the presence of oxygen stimulus. Our algorithm was able to uncover two 
overlapping clusters (operons in bold in Table 5). Interestingly, contrary to 
the other operons of clusters 2, 4 and 6, which are all regulated by operons of 
a single cluster (clusters 1, 3 or 5), these overlaps correspond to co-regulated 
operons. Thus, SSA4 and TKL2 belong to clusters 2 and 4 since they are 
co-regulated by (STE12, TEC1) and (MSN4 and MSN2). Moreover, HSP78, 
CTT1 and PGM2 belong to clusters 4 and 6 since they are co-regulated by 
(MSN4, MSN2) and (YAP1, SKN7). It should also be noted that OSBM did 
not classify 112 operons which all have very low output and input degrees. 

Because the network is sparse, we obtained very poor results with CFinder. 
Indeed, the network contains only one 3-clique and no fc-clique for k > 3. 
Therefore, for k = 2, all the operons were classified into a single cluster and 
no biological information could be retrieved. For k = 3, only three operons 
were classified into a single class and for k > 3 no operon was classified. 

As previously, we ran MMSB and SMB with Q + 1 clusters and we identi- 
fied the class of outliers. Both approaches retrieved the six clusters found by 
OSBM. However, we emphasize that, contrary to the political blogoshpere 
network, MMSB did not uncover any overlap in the yeast transcriptional 
regulatory network. 

As in Section 6.1, these results clearly illustrate the capacity of OSBM 
to retrieve overlapping clusters in networks with complex topological struc- 
tures. In particular, in situations where networks are not made of community 
structures, while the results of CFinder dramatically degrade or cannot even 
be interpreted, OSBM seems particularly promising. 

7. Conclusion. In this paper we proposed a new random graph model, 
the Overlapping Stochastic Block Model, which can be used to retrieve over- 
lapping clusters in networks. We used global and local variational techniques 
to obtain a tractable lower bound of the observed log-likelihood and we de- 
fined an EM like procedure which optimizes the model parameters in turn. 
We showed that the model is identifiable within classes of equivalence and 
we illustrated the efficiency of our approach compared to other methods, 
using simulated data and real networks. Since no assumption is made on 
the matrix W and vectors U and V used to characterize the connection 
probabilities, the model can take very different topological structures into 
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account and seems particularly promising for the analysis of networks. In 
the experiment section we set the number Q of classes using a priori infor- 
mation we had about the networks. However, in future works, we believe it 
is crucial to develop a model selection criterion to estimate the number of 
classes automatically from the topology. We will also investigate introducing 
some priors over the model parameters to work in a full Bayesian framework. 

Acknowledgment. The authors would like to thank C. Matias for her 
helpful remarks and suggestions for the proof on model identifiability. 

SUPPLEMENTARY MATERIAL 

Supplement: Appendix (DOI: 10. 1214/10- AOAS382SUPP; .pdf). Describe 
how global and local variational techniques can be used to obtain a tractable 
lower bound. Introduce the optimization equations for the inference proce- 
dure. 
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